In this practice we will learn how to clean and prepare datasets for Machine Learning (This process is called Data Cleansing).
We will also learn One-Hot encoding and dividing the data into parts (train-validation-test).

Link: https://plotly.com/
Plotly's Python graphing library makes interactive, publication-quality graphs.
Plotly supports over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.
The packege is free and open source and you can view the source, report issues or contribute on GitHub.
First, we want to check Plotly version on the system.
# show plotly version
!pip show plotly
This version (4.4.1) is old, we want to update it so we will have the new features of Plotly (like the new sunburst graph).
# update plotly version
!pip install --upgrade plotly
# import numpy, matplotlib, etc.
import numpy as np
import pandas as pd
# sklearn imports
from sklearn import metrics
from sklearn import pipeline
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import model_selection
We will use the insurance dataset loaded from Github.
with this dataset we want to predict the individual medical cost billed by health insurance.
We can read more about the dataset in Kaggle.
We can also see the a summary of the dataseet's columns in the bottom of the Kaggle page:

We could grab the dataset from Kaggle servers, but it is simpler to download it from Github (Kaggle requires an account in it's site).
Let's download the dataset from Github with Linux command wget.
# download insurance.csv file from Github
!wget https://github.com/stedy/Machine-Learning-with-R-datasets/raw/master/insurance.csv
# load the insurance csv file
insurance_df = pd.read_csv('insurance.csv')
insurance_df
We have 7 columns; 6 features and 1 lable.
The Features:
The Traget:
Some of our features are categorical (sex, smoker and region).
Categorical features are features that has no intrinsic order between their values (smoker, non-smoker).
Some of our features are ordinal (children).
Ordinal features are features similar to categorical features, but there is an order between the values (1 child, 2 childs, etc.).
In ordinal features, there is no meaning for the values in between (there is no 1.5 child).
Some of our features are numerical (bmi, age and the target charges).
Numerical features are like ordinal features, but there is meaning to the values in between (bmi is a scale, there is meaning to each fraction).
Let's start with cleansing the dataset.
The first thing to do, is to check for empty values.
Empty values can be '' in string columns, or NaN values.
# detect np.NaN values in the df
np.where(insurance_df.isnull())
There is no empty values.
Let's insert one empty line to the data.
# add an empty line to the df
insurance_df_cp = insurance_df.copy()
insurance_df_cp.loc[len(insurance_df)] = [np.NaN, "", np.NaN, None, None, "", np.NaN]
insurance_df_cp.loc[len(insurance_df_cp)] = [np.NaN, "", np.NaN, np.NaN, None, None, np.NaN]
insurance_df_cp
The data in real life will have empty values.
When we get a new dataset, we need to fill the empty values, or remove the rows/columns that have them.
In this practice we will fill the values (we don't want to lose valuble data).
Let's check the types of the columns.
# print the type of the columns
insurance_df_cp.dtypes
When a column is of type float64, we know that it is a floating point number.
So, in this column, the only empty value possible is np.NaN.
When a column is of type object, we know that it is a string or a floating point number (with None values).
So, in this column, the empty values possible are np.NaN, "" and None.
The type hierarchy is: int64 < float64 < object.
When there is at least one float64 element in the columm, the column type will be float64.
When there is at least one object element in the columm, the column type will be object.
Let's translate all the empty values to np.NaN values (Pandas works best with these values).
# replace all empty values to np.NaN values
insurance_df_cp.replace('', np.NaN, inplace=True)
insurance_df_cp.fillna(np.NaN, inplace=True)
insurance_df_cp
Let's see the empty values (rows, cols).
# detect np.NaN or None values in the copy of df
print(f'There are {len(np.where(insurance_df_cp.isnull())[0])} empty values in the dataframe:')
print(np.where(insurance_df_cp.isnull()))
We can count how many empty values we have in each column.
# count empty values in each column
def count_empty_values_in_each_column(df):
print('empty values:')
code = "len(np.where(df[column].isnull())[0])"
for column in df.columns:
print(f'`{column}`: {eval(code)}')
count_empty_values_in_each_column(insurance_df_cp)
There isn't a correct way of completing these values.
There are few options for this:
Ordinal features are something between categorical and numerical features, so each one of the methods can work for them.
To complete the empty values in each column, we need to get some data on the column.
Let's start with the categorical columns.
We can show the distribution of each column with pie charts.
# import px and create pie charts for each categorical feature
import plotly.express as px
def create_pie_chart_of_count(df, column_name):
df_not_null = df[~df[column_name].isnull()]
fig = px.pie(df_not_null.groupby([column_name]).size().reset_index(name='count'), names=column_name, values='count')
fig.show()
create_pie_chart_of_count(insurance_df, 'sex')
create_pie_chart_of_count(insurance_df, 'region')
create_pie_chart_of_count(insurance_df, 'smoker')
create_pie_chart_of_count(insurance_df, 'children')
We can show all the plots as subplots with graph objects and pie charts.
# import go and make_subplots and create pie charts subplots of the categorical features
import plotly.graph_objects as go
from plotly.subplots import make_subplots
def create_pie_chart_subplot_of_count(df, columns_names):
rows = int(np.ceil(np.sqrt(len(columns_names))))
cols = int(np.ceil(len(columns_names)/rows))
fig = make_subplots(rows=rows, cols=cols, specs=[[{"type": "domain"} for i in range(cols)] for j in range(rows)])
for i, column_name in enumerate(columns_names):
df_not_null = df[~df[column_name].isnull()]
fig.add_trace(go.Pie(labels=df_not_null.groupby([column_name]).size().reset_index(name='count')[column_name],
values=df_not_null.groupby([column_name]).size().reset_index(name='count')['count'],
name=column_name),
(i)//cols+1, (i)%cols+1)
fig.update_layout(margin=dict(t=10, l=10, r=10, b=10))
fig.show()
create_pie_chart_subplot_of_count(insurance_df, ['sex', 'region', 'smoker', 'children'])
We can show the inner distribution with sunburst charts.
We can show it even for the non-categorical features, we can limit the depth of the chart and put the non-categorical features at the end of the chain.
# create sunburst charts of the features
insurance_df_cp2 = insurance_df.copy()
insurance_df_cp2.insert(len(insurance_df_cp2.columns), "count", 1, True)
fig = px.sunburst(insurance_df_cp2 , path=['children', 'smoker', 'sex', 'region'], values='count')
fig.update_layout(margin=dict(t=10, l=10, r=10, b=10))
fig.show()
fig = px.sunburst(insurance_df_cp2 , path=['children', 'smoker', 'sex', 'region', 'age', 'bmi'], values='count', maxdepth=2)
fig.update_layout(margin=dict(t=10, l=10, r=10, b=10))
fig.show()
In general, we can randomly pick one of the values for categorical features, and pick the mean or median for numerical features (but it won't always be the best way).
# fill empty values in the dataframe
def fill_na_median(df, column_name):
df_not_null = df[~df[column_name].isnull()]
df[column_name].fillna(df_not_null[column_name].median(), inplace=True)
def fill_na_mean(df, column_name):
df_not_null = df[~df[column_name].isnull()]
df[column_name].fillna(df_not_null[column_name].mean(), inplace=True)
def fill_na_random_pick_column_distribution(df, column_name):
df_not_null = df[~df[column_name].isnull()]
df_null = df[df[column_name].isnull()]
options = np.random.choice(df_not_null[column_name])
df[column_name] = df[column_name].apply(lambda x: np.random.choice(df_not_null[column_name]) if pd.isnull(x) else x)
fill_na_median(insurance_df_cp, 'age')
fill_na_mean(insurance_df_cp, 'bmi')
fill_na_mean(insurance_df_cp, 'charges')
fill_na_random_pick_column_distribution(insurance_df_cp, 'region')
fill_na_random_pick_column_distribution(insurance_df_cp, 'children')
fill_na_random_pick_column_distribution(insurance_df_cp, 'smoker')
fill_na_random_pick_column_distribution(insurance_df_cp, 'sex')
insurance_df_cp
# check for empty values
count_empty_values_in_each_column(insurance_df_cp)
We can see that there are no more empty values.
Some machine learning algorithms (like logistic and linear regression) can not work with categorical features and may only work with numerical or ordinal featurs.
The next step in preparing the dataset for model learning is converting the categorical features into numerical features.
There are few ways of doing that:
| old_region_column | new_region_column |
|---|---|
| southwest | 0 |
| northwest | 1 |
| southeast | 2 |
| northeast | 3 |
| old_region_column | new_southwest_column | new_northwest_column | new_southeast_column | new_northeast_column |
|---|---|---|---|---|
| southwest | 1 | 0 | 0 | 0 |
| northwest | 0 | 1 | 0 | 0 |
| southeast | 0 | 0 | 1 | 0 |
| northeast | 0 | 0 | 0 | 1 |
| old_region_column | new_southwest_column | new_northwest_column | new_southeast_column |
|---|---|---|---|
| southwest | 1 | 0 | 0 |
| northwest | 0 | 1 | 0 |
| southeast | 0 | 0 | 1 |
| northeast | 0 | 0 | 0 |
We will use Scikit-learn OneHotEncoder.
We will use it as Dummy Encoder.
# dummy encode the categorical variables in the df
from sklearn.preprocessing import OneHotEncoder
insurance_df_cat = insurance_df_cp[['sex', 'smoker', 'region']]
enc = OneHotEncoder(drop='first', sparse=False)
insurance_df_cat_enc = pd.DataFrame(enc.fit_transform(insurance_df_cat))
insurance_df_cp_enc = insurance_df_cp.drop(['sex', 'smoker', 'region'], axis=1).join(insurance_df_cat_enc)
insurance_df_cp_enc
We can see that sex column has been converted to 1 binary column, the smoker column has been converted to 1 binary column, and the region column has been converted to 3 binary columns.
We can create a method to do this task.
# dummy encode the categorical variables in the df with method
def dummy_encode(df, columns_names):
df_cat = df[columns_names]
enc = OneHotEncoder(drop='first', sparse=False)
df_cat_enc = pd.DataFrame(enc.fit_transform(df_cat))
df_enc = df.drop(columns_names, axis=1).join(df_cat_enc)
return df_enc
insurance_df_cp_enc2 = dummy_encode(insurance_df_cp, ['sex', 'smoker', 'region'])
insurance_df_cp_enc2
We can also use Pandas get_dummies method in one line, and even attach names to the new columns.
# dummy encode the categorical variables in the df with get_dummies
insurance_df_dum = pd.get_dummies(insurance_df_cp, columns=['sex', 'smoker', 'region'], prefix=["sex_type_is", "smoker_type_is", "region_type_is"], drop_first=True)
insurance_df_dum
The difference between the get_dummies approace and the OneHotEncoder approach is that OneHotEncoder can transform few datasets with the same encoding (if we have for example, train and test), while get_dummies only converts one dataframe at a time (the result may have different encodings for the same column in different datasets).
In real life scenerios, we don't have the test data.
We can not check the performence of the model on the same dataset that the model was trained on.
This will result in wrong estimation for the model generalization capabilities.
In order to check our prediction and fine-tune the model parameters, we need to slice the dataset into 2 groups:
We will train on the train data and check the performance on the validation data.
We will slice the dataset with Scikit-learn train_test_split.
First, let's split the data to features X and target t.
# divide the data to features and target
t = insurance_df_cp_enc['charges'].copy()
X = insurance_df_cp_enc.drop(['charges'], axis=1)
print('t')
display(t)
print()
print('X')
display(X)
Now, we can split the data to train and validation.
We can choose number of values for the test_size argument.
Let's check few of them with NE and MSE.
We can plot the data with Plotly scatter.
# print 4 graphs: mse of train/test and r2 of train/test
def print_graphs_r2_mse(graph_points):
for k, v in graph_points.items():
best_value = max(v.values()) if 'R2' in k else min(v.values())
best_index = np.argmax(list(v.values())) if 'R2' in k else np.argmin(list(v.values()))
color = 'red' if 'train' in k else 'blue'
fig = px.scatter(x=v.keys(), y=v.values(), title=f'{k}, best value: x={best_index + 1}, y={best_value}', color_discrete_sequence=[color])
fig.data[0].update(mode='markers+lines')
fig.show()
# plot the score by split and the loss by split
def plot_score_and_loss_by_split(X, t):
graph_points = {
'train_MSE':{},
'val_MSE': {},
'train_R2': {},
'val_R2': {}
}
for size in range(10, 100, 10):
X_train, X_val, t_train, t_val = model_selection.train_test_split(X, t, test_size=size/100, random_state=42)
NE_reg = linear_model.LinearRegression().fit(X_train, t_train)
y_train = NE_reg.predict(X_train)
y_val = NE_reg.predict(X_val)
graph_points['train_MSE'][size/100] = metrics.mean_squared_error(t_train, y_train)
graph_points['val_MSE'][size/100] = metrics.mean_squared_error(t_val, y_val)
graph_points['train_R2'][size/100] = NE_reg.score(X_train, t_train)
graph_points['val_R2'][size/100] = NE_reg.score(X_val, t_val)
print_graphs_r2_mse(graph_points)
plot_score_and_loss_by_split(X, t)
We can see that when the validation data size is small, its loss is small.
One explanation to this is that for a small group of samples, it is easier to match a linear hypothesis.
We can see that from 0.1 to 0.3, the validation loss is smaller than the train loss, and from 0.4 to 0.9 the validation loss is smaller than the test loss.
So, let's give the validation group 35% of the dataset, it is about the right point where the validation loss is equale to the train loss.
# split the data to train and validation
X_train, X_val, t_train, t_val = model_selection.train_test_split(X, t, test_size=0.35, random_state=1)
print('X_train')
display(X_train)
print()
print('t_train')
display(t_train)
print()
print('X_val')
display(X_val)
print()
print('t_val')
display(t_val)
We can see that the data has been splitted randomly to train and validation (X and y are splitted on the same values).
Let's try to train the NE model on the data and print the MSE and R2 graphs of the train and test.
Let's use Scikit-learn PolynomialFeatures, to raise the degree of the model.
This function is adding more featurs, polynomial features of the data.
Example:
If we have the features [a, b] and we want to raise it to the 2nd degree, we get [1, a, b, a^2, ab, b^2].
We can choose not to include the intercept with the include_bias option, and get [a, b, a^2, ab, b^2].
A linear model that will train on these features is like a polinomial model that will train on the original features.
Let's see how it is done on 2 features, age and bmi.
# add 2nd degree features to `age` and `bmi` features
print('original features')
display(X_train[['age', 'bmi']])
pol = preprocessing.PolynomialFeatures(2, include_bias=False)
print()
print('altered features')
pd.DataFrame(pol.fit_transform(X_train[['age', 'bmi']]), columns=['age', 'bmi', 'age^2', 'age*bmi', 'bmi^2'])
We can see that we got the age^2, the age*bmi and the bmi^2 added columns.
Let's train NE model with few degrees and see which degree is best on the age feature.
# plot the score by degree and the loss by degree
def plot_score_and_loss_by_degree(X_train, t_train, X_val, t_val):
graph_points = {
'train_MSE':{},
'val_MSE': {},
'train_R2': {},
'val_R2': {}
}
st_scalar = preprocessing.StandardScaler().fit(X_train)
X_train = st_scalar.transform(X_train)
X_val = st_scalar.transform(X_val)
max_degree_of_features = 20
for degree in range(1, max_degree_of_features):
NE_reg = pipeline.make_pipeline(preprocessing.PolynomialFeatures(degree, include_bias=False), linear_model.LinearRegression())
NE_reg.fit(X_train, t_train)
y_train = NE_reg.predict(X_train)
y_val = NE_reg.predict(X_val)
graph_points['train_MSE'][degree] = metrics.mean_squared_error(y_train, t_train)
graph_points['val_MSE'][degree] = metrics.mean_squared_error(y_val, t_val)
graph_points['train_R2'][degree] = NE_reg.score(X_train, t_train)
graph_points['val_R2'][degree] = NE_reg.score(X_val, t_val)
print_graphs_r2_mse(graph_points)
plot_score_and_loss_by_degree(X_train[['age']], t_train, X_val[['age']], t_val)
We can see that in the MSE loss of the train is smaller than the MSE loss of the validation (it means that the model is performing better on the train).
This is what will happen most of the time (but not all of the time).
The train loss is going down as the complexity of the model gets higher.
The validation loss is going down at first, until it reaches it's peek in degree 3, and then it starts going up as the complexity of the model gets higher.
The case where the train and validation/test losses are going down together is called High Bias.
It means that the model is not robust enough and we need to make it more complex to help it learn better on the data.
The case where the train loss is going down and the validation/test loss is going up is called High Variance.
It means that the model is fitted too much to the train data, and we need to make the model less complex (lower the degree).

We can see the same phenomena (in the opposit direction) in the R2 score graphs.
On High Bias the R2 score graph of the train and validation/test will go up together.
On High Variance the R2 score graph of the train will go up, and the R2 graph of the validation/test will go down.
Another thing we can see in the line graphs of the age feature, is that the R2 score is very low.
This feature alone is not enough to predict the charges target on its own.
If we want to see the regression hypothesis in a graph, we need to choose 1 feature for a 2D graph or 2 featurs for a 3D graph.
Let's plot 2D graph with Plotly Scatter samples with the bmi feature on the x axis.
The target charges will be on the y axis.
Let's add the regression line (with NumPy linspace).
We can add slider for different degrees with Plotly Slider.
# plot the samples by age and bmi, with the regression surface
def plot_samples_with_regression_line(df):
linspace_size = 1000
margin = 0
X_part = df[['bmi']]
t = df['charges']
x_min, x_max = X_part.bmi.min() - margin, X_part.bmi.max() + margin
xrange = pd.DataFrame(np.linspace(x_min, x_max, linspace_size), columns=['bmi'])
graph_points = {
'MSE':{},
'R2': {},
}
fig = go.Figure()
for degree in np.arange(1, 50):
NE_reg = pipeline.make_pipeline(preprocessing.PolynomialFeatures(degree, include_bias=False), linear_model.LinearRegression()).fit(X_part, t)
pred = NE_reg.predict(xrange)
fig.add_trace(go.Scatter(x=X_part['bmi'], y=t, mode='markers', visible=False, name="original data points"))
fig.add_traces(go.Scatter(x=xrange['bmi'], y=pred, mode='lines', name="line degree = " + str(degree), visible=False))
fig.data[0].visible = True
fig.data[1].visible = True
steps = []
for i in range(len(fig.data)//2):
step = dict(
method="update",
args=[{"visible": [False] * len(fig.data)},
{"title": f"Slider switched to degree: {str(i+1)}"}],
label=i+1
)
step["args"][0]["visible"][i*2] = True
step["args"][0]["visible"][i*2+1] = True
steps.append(step)
sliders = [dict(
active=0,
currentvalue={"prefix": "Degree: "},
steps=steps
)]
fig.update_layout(
sliders=sliders
)
fig.show()
plot_samples_with_regression_line(insurance_df_cp)
# plot the samples by age and bmi, with the regression surface
def plot_samples_with_regression_surface(df):
mesh_size = 1
margin = 0
X_part = df[['age', 'bmi']]
t = df['charges']
x_min, x_max = X_part.age.min() - margin, X_part.age.max() + margin
y_min, y_max = X_part.bmi.min() - margin, X_part.bmi.max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
graph_points = {
'MSE':{},
'R2': {},
}
fig = go.Figure()
for degree in np.arange(1, 20):
NE_reg = pipeline.make_pipeline(preprocessing.PolynomialFeatures(degree, include_bias=False), linear_model.LinearRegression()).fit(X_part, t)
pred = NE_reg.predict(np.c_[xx.ravel(), yy.ravel()])
pred = pred.reshape(xx.shape)
fig.add_trace(go.Scatter3d(x=X_part['age'], y=X_part['bmi'], z=t, mode='markers', visible=False, name="original data points"))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred, name="surface degree = " + str(degree), visible=False))
fig.data[0].visible = True
fig.data[1].visible = True
steps = []
for i in range(len(fig.data)//2):
step = dict(
method="update",
args=[{"visible": [False] * len(fig.data)},
{"title": f"Slider switched to degree: {str(i+1)}"}],
label=i+1
)
step["args"][0]["visible"][i*2] = True
step["args"][0]["visible"][i*2+1] = True
steps.append(step)
sliders = [dict(
active=0,
currentvalue={"prefix": "Degree: "},
steps=steps
)]
fig.update_layout(
sliders=sliders
)
fig.show()
plot_samples_with_regression_surface(insurance_df_cp)
Explanation on the difference between Matplotlib, Seaborn and Plotly:
Matplotlib vs. Seaborn vs. Plotly
Post on how to clean datasets using Pandas:
How To Clean Machine Learning Datasets Using Pandas
Explanation on the difference between scatterplot and dotplot:
Difference between scatter-plot and a dotplot
Tutorial on how to use bar charts with Plotly Express:
Step by step bar-charts using Plotly Express
Explanation on the differences between Categorical, Ordinal and Numerical variables:
What is the Difference Between Categorical Ordinal and Numerical Variables?
Explanation on why it is important to define correctly categorical and ordinal features:
Categorical and ordinal feature data representation in regression analysis?
A package for regression tasks on ordinal target:
mord: Ordinal Regression in Python
Explanation on how to predict empty values:
Predict Missing Values in the Dataset
Explanation on the differences between label encoding, one-hot encoding and dummy encoding:
One-Hot Encoding vs. Label Encoding using Scikit-Learn
Wikipedia on multicollinearity:
Multicollinearity
Tutorial on how to use label encoding and one-hot encoding:
Categorical encoding using Label-Encoding and One-Hot-Encoder
A post on the Bias-Variance Decomposition:
Bias-Variance Decomposition
Examples of plots in Plotly that are best for ML Regression:
ML Regression in Python
Documentation of Plotly sliders:
Python Figure Reference: layout.sliders
How to change default control values in Plotly sliders:
Python: Change Custom Control Values in Plotly
An explentation on some rare train/test scenerios:
How is it possible to obtain better results on the test set than on the training set?